Non-record: ASQU activation, Mixture of Convolutions, BankedLinear#679
Open
andrewmouldon wants to merge 7 commits intoopenai:mainfrom
Open
Non-record: ASQU activation, Mixture of Convolutions, BankedLinear#679andrewmouldon wants to merge 7 commits intoopenai:mainfrom
andrewmouldon wants to merge 7 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Explores architectural changes aimed at improving capacity per parameter, even if slower than typical leaderboard approaches:
Ablations use the base train_gpt.py script for a fixed 10k steps. MLP expansion is adjusted to match size.
Results
Results are currently single-seed (1337); additional runs in progress.
ASQU: per-channel parameterization gives ~0.001 bpb improvement over scalar, where the scalar form converges to behavior similar to leaky ReLU^2 with slope 0.5
Also explored learning the exponent instead of fixing the square. While it did not consistently improve final performance (and was more expensive), it revealed a consistent depth-dependent pattern:
With MoC, experiments of generating the dynamic kernel via learned projection performed poorly, suggesting the use of a more constrained mechanism (e.g. basis interpolation) is necessary for stable optimization.
For BankedLinear, one scalar weight is used per bank entry to construct the mixture. Experiments with per head weights worsened performance.
Experiments were done with a layer specific LoRA, but this worsened performance compared to just investing capacity in the MLP.